This example is a little less refined in terms of reporting, though I would like to have it present to demonstrate ability. It is a sample of code with sometimes little explanation for each code chunk. The purpose of this project was to forecast the COVID-19 cases for the โnextโ 7-days and 90-days (from when I turned it in!). The estimate models that were developed performed quite well over the 90 window, as the actual cases for these dates are in the prediction windows from these models.
# read data
df2 <- read_csv("data/all-states-history.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## date = col_date(format = ""),
## state = col_character(),
## dataQualityGrade = col_character()
## )
## See spec(...) for full column specifications.
# head(df2)
# create positivity rate column
df2$pos_rate <- round((df2$positiveIncrease/df2$totalTestResultsIncrease) * 100, 2)
df2$pr_comp <- round((df2$positive/df2$totalTestResults) * 100, 2)
The positivity rate could be calculated in one of two ways, as seen above. One could take the daily positive rate by dividing the total number of tests performed in a day by the number of positive tests, or the comprehensive overall positive rate or since the beginning of the time series. This analysis of the positive test rate is done by the later method. This provided a more stable metric through the course of the series. This doesnโt seem to be a good metric to judge the severity of the pandemic by itself. It is entirely dependent on a great many factors, and going by this statistic alone would be a bad judge. One issue which was noticed was when the dataset has a small number of tests in a day. The number of tests reported on that day could have been delayed, it was a holiday, etc. But on one of these slow days, if only 50 tests were reported and 25 of them were positive, it would seem very severe. Another thing to keep in mind is the accessibility of receiving COVID testing or possibly the social resistance. For instance, in certain parts of the country, possibly more people are willing to be tested for COVID without symptoms, so there are a great many of healthy people getting tested โjust in caseโ. In other parts of the country or even within the same state, there may be a resistance to getting testing, so only people who are feeling sick are getting tested, which may raise the positive test rate.
This study will not focus on this methodology, rather it will be centered around trying to predict the total number of positive cases.
# check for NA values
gg_miss_var(df2)
There are a lot of missing values. This wonโt matter much in the long run as we are only concerned with a few variables.
# filter out a bunch of NAs, many states don't have data etc.
df <- df2 %>% filter(date >= "2020-04-01")
The dataset was truncated to this window. Prior to this date, some of the data for the state level analysis was very patchy. There seemed to be a lot of missing values prior to this date, with regard to the columns of interest. The NA values created issues with some of the funcions in R, so the data was truncated rather than any type of imputation.
# create column for state level analysis
wa_df <- df %>% filter(df$state == 'WA')
# modify dataset for ease of use
wa_mod_df <- data.frame(date = rev(wa_df$date), positive = rev(wa_df$positive), totalTestResults = rev(wa_df$totalTestResults), death = rev(wa_df$death))
# Total Death Histogram
df22 <- filter(df, date == "2020-12-04")
df22 %>%
ggplot(color = "black")+
geom_bar(mapping=aes(x=reorder(state, death), y=death, fill= death), color='darkred', stat = 'identity', size = 0.1) +
coord_flip() +
ylab("Deaths") +
xlab('State') +
ggtitle('COVID-19 Deaths by State') +
theme_minimal() +
theme(axis.text=element_text(color = "white"), axis.text.y = element_text(size=6), legend.position = "none", plot.background = element_rect("steelblue4"), panel.background = element_rect(color="seashell3", fill='white'), panel.grid.major.y = element_line(color="ivory2"), panel.grid.major.x = element_line("steelblue4"), axis.title = element_text(color = "white"), plot.title = element_text(color = "white")) +
scale_fill_gradient2(low = "red", mid = "red3", high = "red4", midpoint = 10000)
Top 10 States by Deaths
| date | state | death | positive | totalTestResults | pr_comp |
|---|---|---|---|---|---|
| 2020-12-04 | NY | 27017 | 685364 | 20173461 | 3.4 |
| 2020-12-04 | TX | 22255 | 1228812 | 11111036 | 11.06 |
| 2020-12-04 | CA | 19582 | 1286557 | 24675478 | 5.21 |
| 2020-12-04 | FL | 19236 | 1022354 | 12857097 | 7.95 |
| 2020-12-04 | NJ | 17255 | 356662 | 6182499 | 5.77 |
| 2020-12-04 | IL | 13782 | 770088 | 10918998 | 7.05 |
| 2020-12-04 | PA | 11113 | 398600 | 3291921 | 12.11 |
| 2020-12-04 | MA | 10910 | 246398 | 8774697 | 2.81 |
| 2020-12-04 | MI | 10117 | 420268 | 6900160 | 6.09 |
| 2020-12-04 | GA | 9725 | 438300 | 4435020 | 9.88 |
# Positive Cases Histogram
df22 %>%
ggplot(color = "black")+
geom_bar(mapping=aes(x=reorder(state, positive), y=positive, fill= positive), color='darkred', stat = 'identity', size = 0.1) +
coord_flip() +
ylab("Positive Cases") +
xlab('State') +
ggtitle('Positive Case Count by State') +
theme_minimal() +
theme(axis.text=element_text(color = "white"), axis.text.y = element_text(size=6), legend.position = "none", plot.background = element_rect("steelblue4"), panel.background = element_rect(color="seashell3", fill='white'), panel.grid.major.y = element_line(color="ivory2"), panel.grid.major.x = element_line("steelblue4"), axis.title = element_text(color = "white"), plot.title = element_text(color = "white")) +
scale_fill_gradient2(low = "red", mid = "red3", high = "red4", midpoint = 125000)
Top 10 States by Positive Cases
| date | state | death | positive | totalTestResults | pr_comp |
|---|---|---|---|---|---|
| 2020-12-04 | CA | 19582 | 1286557 | 24675478 | 5.21 |
| 2020-12-04 | TX | 22255 | 1228812 | 11111036 | 11.06 |
| 2020-12-04 | FL | 19236 | 1022354 | 12857097 | 7.95 |
| 2020-12-04 | IL | 13782 | 770088 | 10918998 | 7.05 |
| 2020-12-04 | NY | 27017 | 685364 | 20173461 | 3.4 |
| 2020-12-04 | OH | 6882 | 456963 | 6349442 | 7.2 |
| 2020-12-04 | GA | 9725 | 438300 | 4435020 | 9.88 |
| 2020-12-04 | WI | 3842 | 432207 | 4556566 | 9.49 |
| 2020-12-04 | MI | 10117 | 420268 | 6900160 | 6.09 |
| 2020-12-04 | PA | 11113 | 398600 | 3291921 | 12.11 |
# Total Cases Histogram
df22 %>%
ggplot(color = "black")+
geom_bar(mapping=aes(x=reorder(state, totalTestResults), y=totalTestResults, fill= positive), color='darkred', stat = 'identity', size = 0.1) +
coord_flip() +
ylab("Positive Cases") +
xlab('State') +
ggtitle('Total Test Count by State') +
theme_minimal() +
theme(axis.text=element_text(color = "white"), axis.text.y = element_text(size=6), legend.position = "none", plot.background = element_rect("steelblue4"), panel.background = element_rect(color="seashell3", fill='white'), panel.grid.major.y = element_line(color="ivory2"), panel.grid.major.x = element_line("steelblue4"), axis.title = element_text(color = "white"), plot.title = element_text(color = "white")) +
scale_fill_gradient2(low = "red", mid = "red3", high = "red4", midpoint = 125000)
Top 10 States by Total Tests
| date | state | death | positive | totalTestResults | pr_comp |
|---|---|---|---|---|---|
| 2020-12-04 | CA | 19582 | 1286557 | 24675478 | 5.21 |
| 2020-12-04 | NY | 27017 | 685364 | 20173461 | 3.4 |
| 2020-12-04 | FL | 19236 | 1022354 | 12857097 | 7.95 |
| 2020-12-04 | TX | 22255 | 1228812 | 11111036 | 11.06 |
| 2020-12-04 | IL | 13782 | 770088 | 10918998 | 7.05 |
| 2020-12-04 | MA | 10910 | 246398 | 8774697 | 2.81 |
| 2020-12-04 | MI | 10117 | 420268 | 6900160 | 6.09 |
| 2020-12-04 | OH | 6882 | 456963 | 6349442 | 7.2 |
| 2020-12-04 | NJ | 17255 | 356662 | 6182499 | 5.77 |
| 2020-12-04 | NC | 5467 | 382534 | 5485287 | 6.97 |
Top 10 States by Overall Positive Test Rate
| date | state | death | positive | totalTestResults | pr_comp |
|---|---|---|---|---|---|
| 2020-12-04 | SD | 1064 | 84398 | 336512 | 25.08 |
| 2020-12-04 | ID | 1014 | 106455 | 484192 | 21.99 |
| 2020-12-04 | KS | 1786 | 168295 | 845186 | 19.91 |
| 2020-12-04 | IA | 2605 | 210236 | 1089561 | 19.3 |
| 2020-12-04 | AL | 3831 | 264199 | 1626368 | 16.24 |
| 2020-12-04 | PR | 1173 | 54597 | 360569 | 15.14 |
| 2020-12-04 | AZ | 6885 | 352101 | 2326744 | 15.13 |
| 2020-12-04 | MS | 3916 | 161516 | 1143911 | 14.12 |
| 2020-12-04 | PA | 11113 | 398600 | 3291921 | 12.11 |
| 2020-12-04 | UT | 925 | 209170 | 1887656 | 11.08 |
plotts.wge(wa_mod_df$positive)
This realization is of the number of positive case. It does not look to be stationary realization, as there doenโt seem to be a constant mean.
# df test for unit root
adf.test(wa_mod_df$positive)
##
## Augmented Dickey-Fuller Test
##
## data: wa_mod_df$positive
## Dickey-Fuller = 1.5918, Lag order = 6, p-value = 0.99
## alternative hypothesis: stationary
This Dicky-Fuller test shows that there is evidence of a unit root. The data will be differenced and re-checked.
wa_d1 <- artrans.wge(wa_mod_df$positive,phi.tr = 1)
# df test for unit root
adf.test(wa_d1)
##
## Augmented Dickey-Fuller Test
##
## data: wa_d1
## Dickey-Fuller = -0.60421, Lag order = 6, p-value = 0.9766
## alternative hypothesis: stationary
There is additional evidence of a unit root, the data will be differenced again.
wa_d2 <- artrans.wge(wa_d1,phi.tr = 1)
# df test for unit root
adf.test(wa_d2)
##
## Augmented Dickey-Fuller Test
##
## data: wa_d2
## Dickey-Fuller = -11.604, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
Now the unit roots have been removed, the data will be modeled.
aic5.wge(wa_d2, p = 0:10, q = 0:5, type = 'aic')
## ---------WORKING... PLEASE WAIT...
##
##
## Error in aic calculation at 8 5
## Five Smallest Values of aic
## p q aic
## 46 7 3 12.29050
## 52 8 3 12.29587
## 40 6 3 12.29684
## 58 9 3 12.30413
## 64 10 3 12.31263
aic5.wge(wa_d2, p = 0:10, q = 0:5, type = 'bic')
## ---------WORKING... PLEASE WAIT...
##
##
## Error in aic calculation at 8 5
## Five Smallest Values of bic
## p q bic
## 5 0 4 12.40930
## 11 1 4 12.42376
## 6 0 5 12.43656
## 40 6 3 12.43933
## 17 2 4 12.44363
The first candidate model is the ARUMA(6,2,3). This appears on both lists for AIC and BIC.
wa_est <- est.arma.wge(wa_d2, p = 6, q = 3)
##
## Coefficients of Original polynomial:
## -0.2582 0.5921 -0.3011 -0.0389 -0.0450 -0.3582
##
## Factor Roots Abs Recip System Freq
## 1+1.8077B+0.8465B^2 -1.0678+-0.2028i 0.9200 0.4701
## 1-1.5226B+0.7767B^2 0.9802+-0.5716i 0.8813 0.0840
## 1-0.0270B+0.5449B^2 0.0248+-1.3545i 0.7382 0.2471
##
##
ljung.wge(wa_est$res)
## Obs 0.03329414 -0.01516273 0.01115798 -0.0179962 -0.005229892 0.01849734 0.04534566 0.06934604 -0.1091581 0.01865503 0.1262003 -0.06819696 0.01352833 0.07811801 0.1087244 0.02405059 0.0323192 -0.04004855 0.004045124 0.04023166 0.1424104 0.06907163 0.02104473 -0.009092386
## $test
## [1] "Ljung-Box test"
##
## $K
## [1] 24
##
## $chi.square
## [1] 23.82146
##
## $df
## [1] 24
##
## $pval
## [1] 0.4718438
Quick peak at the last 7 for forecasting.
wa_fore <- fore.aruma.wge(wa_mod_df$positive[1:241], phi = wa_est$phi, theta = wa_est$theta, d = 2, n.ahead = 7, plot =F)
wa_arma63_ase <- mean((wa_mod_df$positive[242:248]-wa_fore$f)^2)
wa_arma63_ase ##
## [1] 1212613
plot(seq(1,248,1),wa_mod_df$positive, type = "b", xlim = c(240,250), ylim = c(150000, 180000))
points(seq(242,248,1),wa_fore$f,type = "b", pch = 15)
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- wa_mod_df$positive[1:248]
ASEHolder_C <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
phis_C = wa_est$phi
thetas_C = wa_est$theta
d_C = 2
i = 2
for (i in 2:9) {
srt <- windows[i] - 30
forecasts <- fore.aruma.wge(
df[1:(windows[i] - 7)],
phi = phis_C,
theta = thetas_C,
d = d_C,
n.ahead = horizon,
plot = F
)
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(forecasts$f)
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - forecasts$f)^2)
ASEHolder_C[nrow(ASEHolder_C) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 14266 14379 14748 15043 15371 15619 15936
## [1] 14360.51 14629.15 14883.70 15166.43 15434.39 15703.57 15987.45
## [1] 55
## [1] 21634 21985 22285 22630 22778 22878 23288
## [1] 21533.51 21792.56 22037.86 22278.26 22532.62 22762.23 22957.04
## [1] 86
## [1] 33803 34292 34556 35540 36442 37320 38189
## [1] 33721.06 34274.54 34835.03 35474.37 36077.16 36744.22 37392.02
## [1] 117
## [1] 56746 57655 58530 59263 59950 60360 60588
## [1] 56552.03 57334.89 58070.87 58731.54 59483.57 60218.42 60872.06
## [1] 148
## [1] 72161 72703 73301 74320 74320 74635 74939
## [1] 72088.44 72384.04 72565.26 72629.82 72676.02 72670.95 72583.04
## [1] 179
## [1] 85830 86269 86638 87042 87522 88116 88810
## [1] 85520.57 85930.19 86368.33 86874.84 87342.93 87717.68 88315.29
## [1] 210
## [1] 104027 104743 105557 106573 107501 108315 109354
## [1] 104051.4 104840.8 105286.3 105752.1 106474.0 107027.4 107785.3
## [1] 241
## [1] 158167 160634 162700 165019 167216 170342 172437
## [1] 156791.5 159352.3 161483.8 163805.6 166152.5 169651.8 171808.9
meanWindowedASE_C <- mean(ASEHolder_C$ase)
medWindowedASE_C <- median(ASEHolder_C$ase)
ASEHolder_C
## window ase train_start train_end test_start test_end
## 1 1 16996.69 1 24 25 31
## 2 2 59297.16 32 55 56 62
## 3 3 169855.91 63 86 87 93
## 4 4 135947.34 94 117 118 124
## 5 5 2230817.19 125 148 149 155
## 6 6 106666.26 156 179 180 186
## 7 7 847278.67 187 210 211 217
## 8 8 1212613.03 218 241 242 248
ASEHolder_C %>%
ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6", "7", "8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 1500000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_C,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 1700000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_C),0)), nudge_x = 0.55)
The above chart shows the ASE result for each of the windows, in the rolling window. The mean is displayed on the chart as well.
This is the second ARUMA candidate.
wa_est2 <- est.arma.wge(wa_d2, p = 0, q = 4)
ljung.wge(wa_est2$res)
## Obs -0.02985844 0.01212079 0.03116676 0.09150417 -0.10852 -0.1006001 0.07738768 0.06746096 -0.03185799 -0.01949065 0.1946641 -0.06446766 0.005557153 0.01020312 0.09415268 -0.06839457 0.03369336 -0.03971164 0.01945628 0.04345121 0.1064747 0.05680022 0.04461302 0.009819108
## $test
## [1] "Ljung-Box test"
##
## $K
## [1] 24
##
## $chi.square
## [1] 31.62015
##
## $df
## [1] 24
##
## $pval
## [1] 0.136695
wa_fore2 <- fore.aruma.wge(wa_mod_df$positive[1:241], phi = wa_est2$phi, theta = wa_est2$theta, d = 2, n.ahead = 7, plot = F)
wa_arma04_ase <- mean((wa_mod_df$positive[242:248]-wa_fore2$f)^2)
wa_arma04_ase ##
## [1] 1503970
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- wa_mod_df$positive[1:248]
ASEHolder_D <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
phis_D = wa_est2$phi
thetas_D = wa_est2$theta
d_D = 2
i = 2
for (i in 2:9) {
srt <- windows[i] - 30
forecasts <- fore.aruma.wge(
df[1:(windows[i] - 7)],
phi = phis_D,
theta = thetas_D,
d = d_D,
n.ahead = horizon,
plot = F
)
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(forecasts$f)
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - forecasts$f)^2)
ASEHolder_D[nrow(ASEHolder_D) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 14266 14379 14748 15043 15371 15619 15936
## [1] 14321.36 14535.82 14751.98 15004.96 15257.94 15510.92 15763.90
## [1] 55
## [1] 21634 21985 22285 22630 22778 22878 23288
## [1] 21531.59 21815.63 22072.37 22295.73 22519.09 22742.44 22965.80
## [1] 86
## [1] 33803 34292 34556 35540 36442 37320 38189
## [1] 33630.76 34121.49 34646.34 35267.28 35888.23 36509.17 37130.11
## [1] 117
## [1] 56746 57655 58530 59263 59950 60360 60588
## [1] 56650.97 57561.21 58434.11 59140.93 59847.75 60554.57 61261.39
## [1] 148
## [1] 72161 72703 73301 74320 74320 74635 74939
## [1] 72128.58 72436.41 72624.22 72652.61 72680.99 72709.38 72737.77
## [1] 179
## [1] 85830 86269 86638 87042 87522 88116 88810
## [1] 85462.50 85763.88 86083.33 86613.79 87144.25 87674.71 88205.17
## [1] 210
## [1] 104027 104743 105557 106573 107501 108315 109354
## [1] 104022.1 104906.2 105593.0 106267.7 106942.4 107617.1 108291.8
## [1] 241
## [1] 158167 160634 162700 165019 167216 170342 172437
## [1] 156153.4 158738.3 161661.3 164245.6 166829.8 169414.1 171998.3
meanWindowedASE_D <- mean(ASEHolder_D$ase)
medWindowedASE_D <- median(ASEHolder_D$ase)
ASEHolder_D
## window ase train_start train_end test_start test_end
## 1 1 11886.11 1 24 25 31
## 2 2 55049.47 32 55 56 62
## 3 3 318089.88 63 86 87 93
## 4 4 77670.08 94 117 118 124
## 5 5 2078589.14 125 148 149 155
## 6 6 226352.88 156 179 180 186
## 7 7 292664.28 187 210 211 217
## 8 8 1503970.02 218 241 242 248
ASEHolder_D %>%
ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6", "7", "8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 1500000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_D,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 1700000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_D),0)), nudge_x = 0.55)
This second ARUMA(0,2,4) model performs the best on the short 7 forecast check and the rolling windows. This will be used as the final AR/MA type model.
wa_d2_fore_final_90 <- fore.aruma.wge(wa_mod_df$positive, phi = wa_est2$phi, theta = wa_est2$theta, d=2, n.ahead = 90, plot = F)
wa_plot_df <- data.frame(date = wa_mod_df$date, positive = wa_mod_df$positive, ul = wa_mod_df$positive, ll = wa_mod_df$positive)
wa_plot_df <- rbind(wa_plot_df, data.frame(date = as.Date(future_dates$date, format = "%m/%d/%y"), positive = wa_d2_fore_final_90$f, ul = wa_d2_fore_final_90$ul, ll = wa_d2_fore_final_90$ll))
wa_plot_df[247:255,] %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = "red", alpha = 0.1) +
theme_minimal() +
labs(title = "WA Total Positive Cases - 7-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 day", date_labels = "%b %d") +
scale_y_continuous(labels = scales::comma)
The chart above show the forecast for the next 7 days. The red shaded region depicts the confidence interval of the predictions.
wa_plot_df %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = "red", alpha = 0.1) +
theme_minimal() +
labs(title = "WA Total Positive Cases - 90-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 month", date_labels = "%b - %y") +
scale_y_continuous(breaks = c(seq(0,800000, 100000)), labels = scales::comma) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
This is the forecast for the next 90 days, as above the confidence intervals are displayed in red.
wa_ts = ts(wa_mod_df$positive[1:241], frequency = 1)
This is the first candidate multilayer perceptron candidate model for the state analysis.
set.seed(2)
wa_mlp <- mlp(wa_ts, comb = c('mean'), sel.lag = TRUE)
wa_mlp_f <- forecast(wa_mlp, h = 7)
plot(seq(1,248,1),wa_mod_df$positive, type = "b", xlim = c(240,250), ylim = c(150000, 200000))
points(seq(242,248,1),wa_mlp_f$mean,type = "b", pch = 15)
wa_mlp_ase <- mean((wa_mod_df$positive[242:248]-wa_mlp_f$mean)^2)
wa_mlp_ase
## [1] 311039839
From the quick check it looks like the MLP above is overestimating the forecast. Proceed to rolling window.
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- wa_mod_df$positive[1:248]
ASEHolder_mlp <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
i = 2
set.seed(2)
for (i in 2:9) {
srt <- windows[i] - 30
wa_ts_tmp = ts(wa_mod_df$positive[1:(windows[i]-horizon)], frequency = 1)
wa_mlp_tmp <- mlp(wa_ts_tmp, comb = c('mean'), sel.lag = TRUE)
wa_mlp_f <- forecast(wa_mlp_tmp, h = 7)
forecasts <- forecast(wa_mlp_tmp, h = 7)
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(as.numeric(forecasts$mean))
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - forecasts$mean)^2)
ASEHolder_mlp[nrow(ASEHolder_mlp) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 14266 14379 14748 15043 15371 15619 15936
## [1] 14401.30 14693.58 14983.03 15270.03 15554.14 15840.38 16123.79
## [1] 55
## [1] 21634 21985 22285 22630 22778 22878 23288
## [1] 21587.24 21887.66 22162.53 22436.09 22716.46 22963.21 23194.88
## [1] 86
## [1] 33803 34292 34556 35540 36442 37320 38189
## [1] 33813.46 34422.50 35011.86 35580.41 36124.57 36637.96 37111.47
## [1] 117
## [1] 56746 57655 58530 59263 59950 60360 60588
## [1] 56710.43 57650.17 58554.69 59203.79 59702.37 60332.51 61088.14
## [1] 148
## [1] 72161 72703 73301 74320 74320 74635 74939
## [1] 71733.98 71771.25 71701.46 71460.64 71111.88 70752.41 70414.79
## [1] 179
## [1] 85830 86269 86638 87042 87522 88116 88810
## [1] 85943.97 86552.83 87059.21 87487.38 87882.95 88346.53 88776.87
## [1] 210
## [1] 104027 104743 105557 106573 107501 108315 109354
## [1] 104021.2 104530.7 105015.3 105518.6 106008.3 106511.2 106983.5
## [1] 241
## [1] 158167 160634 162700 165019 167216 170342 172437
## [1] 163469.7 167431.2 174682.7 176473.7 187308.7 191593.5 199258.3
meanWindowedASE_mlp <- mean(ASEHolder_mlp$ase)
medWindowedASE_mlp <- median(ASEHolder_mlp$ase)
ASEHolder_mlp
## window ase train_start train_end test_start test_end
## 1 1 48837.35 1 24 25 31
## 2 2 11997.16 32 55 56 62
## 3 3 279084.50 63 86 87 93
## 4 4 45373.70 94 117 118 124
## 5 5 8231421.33 125 148 149 155
## 6 6 93408.40 156 179 180 186
## 7 7 1793057.26 187 210 211 217
## 8 8 274834385.08 218 241 242 248
ASEHolder_mlp %>% ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6","7","8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 30000000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_mlp,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 40000000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_mlp),0)), nudge_x = 0.55)
This model performs very poorly in the last of the rolling window, or at least is appears as though it does.
Second candidate MLP for state analysis.
set.seed(2)
wa_mlp2 <- mlp(wa_ts, sel.lag = TRUE, allow.det.season = TRUE, hd.max = 'valid')
wa_mlp_f2 <- forecast(wa_mlp2, h = 7)
wa_mlp_ase <- mean((wa_mod_df$positive[242:248]-wa_mlp_f2$mean)^2)
wa_mlp_ase
## [1] 165622982
plot(seq(1,248,1),wa_mod_df$positive, type = "b", xlim = c(240,250), ylim = c(150000, 200000))
points(seq(242,248,1),wa_mlp_f2$mean,type = "b", pch = 15)
This does a little better job with the quick peak window.
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- wa_mod_df$positive[1:248]
ASEHolder_mlp2 <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
i = 2
set.seed(2)
for (i in 2:9) {
srt <- windows[i] - 30
wa_ts_tmp = ts(wa_mod_df$positive[1:(windows[i]-horizon)], frequency = 1)
wa_mlp2_tmp <- mlp(wa_ts_tmp, sel.lag = TRUE, allow.det.season = TRUE, hd.max = 'valid', comb = c('median'))
wa_mlp2_f <- forecast(wa_mlp2_tmp, h = 7)
forecasts <- forecast(wa_mlp2_tmp, h = 7)
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(as.numeric(forecasts$mean))
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - forecasts$mean)^2)
ASEHolder_mlp2[nrow(ASEHolder_mlp2) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 14266 14379 14748 15043 15371 15619 15936
## [1] 14404.68 14697.28 14984.62 15270.54 15554.81 15841.17 16126.96
## [1] 55
## [1] 21634 21985 22285 22630 22778 22878 23288
## [1] 21581.04 21866.18 22160.82 22422.70 22724.94 22964.95 23221.08
## [1] 86
## [1] 33803 34292 34556 35540 36442 37320 38189
## [1] 33813.69 34421.28 35011.93 35583.98 36130.30 36640.68 37105.50
## [1] 117
## [1] 56746 57655 58530 59263 59950 60360 60588
## [1] 56789.97 57765.09 58718.86 59500.24 60102.65 60590.76 61162.20
## [1] 148
## [1] 72161 72703 73301 74320 74320 74635 74939
## [1] 71743.18 71862.83 72017.34 72164.29 72307.28 72447.91 72576.97
## [1] 179
## [1] 85830 86269 86638 87042 87522 88116 88810
## [1] 85967.40 86636.63 87224.54 87625.94 88034.73 88423.99 88746.05
## [1] 210
## [1] 104027 104743 105557 106573 107501 108315 109354
## [1] 104045.5 104554.9 105029.9 105550.8 106116.2 106671.8 107169.8
## [1] 241
## [1] 158167 160634 162700 165019 167216 170342 172437
## [1] 163025.0 165907.7 170724.9 172420.3 182708.5 186570.8 192392.9
meanWindowedASE_mlp2 <- mean(ASEHolder_mlp2$ase)
medWindowedASE_mlp2 <- median(ASEHolder_mlp2$ase)
ASEHolder_mlp2
## window ase train_start train_end test_start test_end
## 1 1 49700.50 1 24 25 31
## 2 2 12881.45 32 55 56 62
## 3 3 279890.67 63 86 87 93
## 4 4 73180.26 94 117 118 124
## 5 5 3084129.71 125 148 149 155
## 6 6 171555.25 156 179 180 186
## 7 7 1535231.37 187 210 211 217
## 8 8 153174137.19 218 241 242 248
ASEHolder_mlp2 %>% ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6","7","8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 30000000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_mlp2,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 40000000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_mlp2),0)), nudge_x = 0.55)
This MLP model performs better than the first candidate. A model and predictions will be produced with these same parameters.
set.seed(2)
wa_ts_fin = ts(wa_mod_df$positive, frequency = 1)
wa_mlp_fin <- mlp(wa_ts_fin, sel.lag = TRUE, allow.det.season = TRUE, hd.max = 'valid', comb = c('median'))
#wa_mlp_f_fin7 <- forecast(wa_mlp_fin, h = 7)
wa_mlp_f_fin90 <- forecast(wa_mlp_fin, h = 90)
wa_plot_df <- data.frame(date = wa_mod_df$date, positive = wa_mod_df$positive)
wa_plot_df <- rbind(wa_plot_df, data.frame(date = as.Date(future_dates$date, format = "%m/%d/%y"), positive = wa_mlp_f_fin90$mean))
wa_plot_df[247:255,] %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
theme_minimal() +
labs(title = "WA Total Positive Cases - 7-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 day", date_labels = "%b %d") +
scale_y_continuous(labels = scales::comma)
This plots the forecast for the next 7 days with the โbestโ MLP model.
wa_plot_df %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
theme_minimal() +
labs(title = "WA Total Positive Cases - 90-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 month", date_labels = "%b - %y") +
scale_y_continuous(breaks = c(seq(0,800000, 100000)), labels = scales::comma) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
This plots the forecast for the next 90 days with the โbestโ MLP model.
wa_ens_ase <- mean((wa_mod_df$positive[220:226]-((as.numeric(wa_mlp_f2$mean)+wa_fore$f)/2))^2)
wa_ens_ase
## [1] 2892513035
plot(seq(1,248,1),wa_mod_df$positive, type = "b", xlim = c(235,250), ylim = c(150000, 200000))
points(seq(242,248,1),(as.numeric(wa_mlp_f2$mean)+wa_fore$f)/2,type = "b", pch = 15)
wa_ens_f <- (as.numeric(wa_mlp_f_fin90$mean)+wa_d2_fore_final_90$f)/2
wa_plot_df <- data.frame(date = wa_mod_df$date, positive = wa_mod_df$positive)
wa_plot_df <- rbind(wa_plot_df, data.frame(date = as.Date(future_dates$date, format = "%m/%d/%y"), positive = wa_ens_f))
wa_plot_df %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
theme_minimal() +
labs(title = "WA Total Positive Cases - 90-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 month", date_labels = "%b - %y") +
scale_y_continuous(breaks = c(seq(0,800000, 100000)), labels = scales::comma) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
The above is an ensemble of the forecasts for the next 7/90 days. The ensemble is done by averaging the results of the ARUMA(0,2,4) and the second MLP.
df <- df2 %>% filter(date >= "2020-04-01")
us_agg <- aggregate(abs(df[c("positiveIncrease", "death", "totalTestResultsIncrease", "positive", "totalTestResults")]), by=df["date"], sum)
us_agg$pos_rate <- round((us_agg$positiveIncrease/us_agg$totalTestResultsIncrease) * 100, 2)
us_agg$pr_comp <- round((us_agg$positive/us_agg$totalTestResults) * 100, 2)
adf.test(us_agg$positive)
##
## Augmented Dickey-Fuller Test
##
## data: us_agg$positive
## Dickey-Fuller = 0.12686, Lag order = 6, p-value = 0.99
## alternative hypothesis: stationary
The result of the Dicky-Fuller test shows evidence of a unit root.
us_d1 <- artrans.wge(us_agg$positive,phi.tr = 1)
adf.test(us_agg$positive)
##
## Augmented Dickey-Fuller Test
##
## data: us_agg$positive
## Dickey-Fuller = 0.12686, Lag order = 6, p-value = 0.99
## alternative hypothesis: stationary
The first differenced data also show evidence of a unit root.
us_d2 <- artrans.wge(us_d1,phi.tr = 1)
Now that the data is resonably stationary, the modeling will begin by selecting an appropriate p and q for the ARUMA model.
aic5.wge(us_d2, p = 0:10, q = 0:5, type = "aic")
## ---------WORKING... PLEASE WAIT...
##
##
## Error in aic calculation at 3 2
## Error in aic calculation at 3 3
## Error in aic calculation at 3 4
## Error in aic calculation at 4 2
## Error in aic calculation at 4 4
## Error in aic calculation at 5 2
## Error in aic calculation at 5 3
## Error in aic calculation at 5 4
## Error in aic calculation at 5 5
## Error in aic calculation at 6 2
## Error in aic calculation at 6 3
## Error in aic calculation at 6 5
## Error in aic calculation at 7 2
## Error in aic calculation at 7 3
## Error in aic calculation at 8 2
## Error in aic calculation at 8 3
## Error in aic calculation at 8 4
## Error in aic calculation at 8 5
## Error in aic calculation at 9 2
## Error in aic calculation at 9 3
## Error in aic calculation at 9 4
## Error in aic calculation at 9 5
## Error in aic calculation at 10 2
## Error in aic calculation at 10 3
## Error in aic calculation at 10 4
## Error in aic calculation at 10 5
## Five Smallest Values of aic
## p q aic
## 48 7 5 17.77546
## 47 7 4 17.78039
## 62 10 1 17.83095
## 61 10 0 17.83810
## 41 6 4 17.86812
aic5.wge(us_d2, p = 0:10, q = 0:5, type = "bic")
## ---------WORKING... PLEASE WAIT...
##
##
## Error in aic calculation at 3 2
## Error in aic calculation at 3 3
## Error in aic calculation at 3 4
## Error in aic calculation at 4 2
## Error in aic calculation at 4 4
## Error in aic calculation at 5 2
## Error in aic calculation at 5 3
## Error in aic calculation at 5 4
## Error in aic calculation at 5 5
## Error in aic calculation at 6 2
## Error in aic calculation at 6 3
## Error in aic calculation at 6 5
## Error in aic calculation at 7 2
## Error in aic calculation at 7 3
## Error in aic calculation at 8 2
## Error in aic calculation at 8 3
## Error in aic calculation at 8 4
## Error in aic calculation at 8 5
## Error in aic calculation at 9 2
## Error in aic calculation at 9 3
## Error in aic calculation at 9 4
## Error in aic calculation at 9 5
## Error in aic calculation at 10 2
## Error in aic calculation at 10 3
## Error in aic calculation at 10 4
## Error in aic calculation at 10 5
## Five Smallest Values of bic
## p q bic
## 47 7 4 17.95138
## 48 7 5 17.96070
## 61 10 0 17.99485
## 62 10 1 18.00194
## 49 8 0 18.00575
The model chosen is on both lists, it is the best by BIC and second best by AIC.
us_est <- est.arma.wge(us_d2, p = 7, q = 4)
##
## Coefficients of Original polynomial:
## 0.5299 -0.6832 0.1470 -0.2604 -0.1767 0.1425 0.2807
##
## Factor Roots Abs Recip System Freq
## 1-1.2392B+0.9967B^2 0.6217+-0.7854i 0.9984 0.1434
## 1+0.3453B+0.8760B^2 -0.1971+-1.0501i 0.9359 0.2795
## 1-0.7621B 1.3122 0.7621 0.0000
## 1+1.1261B+0.4219B^2 -1.3345+-0.7676i 0.6496 0.4169
##
##
ljung.wge(us_est$res)
## Obs -0.0006893617 -0.003279563 -0.004832179 -3.203953e-05 0.005869496 0.01940347 -0.07556891 0.07977467 0.01012716 -0.0656494 0.000431848 -0.02720915 -0.0172305 0.1197227 0.01710609 0.01270574 -0.01232756 0.0205792 -0.06443041 0.04628213 0.1069666 0.03736294 0.0607039 -0.03178919
## $test
## [1] "Ljung-Box test"
##
## $K
## [1] 24
##
## $chi.square
## [1] 15.11901
##
## $df
## [1] 24
##
## $pval
## [1] 0.9172282
us_fore <- fore.aruma.wge(us_agg$positive[1:242], phi = us_est$phi, theta = us_est$theta, d = 2, n.ahead = 7, plot = F)
us_arma74_ase <- mean((us_agg$positive[242:248] - us_fore$f)^2)
us_arma74_ase
## [1] 9647267344
plot(seq(1,248,1),us_agg$positive, type = "b", xlim = c(235,250), ylim = c(11800000, 14500000))
points(seq(242,248,1),us_fore$f,type = "b", pch = 15)
Quick peak window analysis of the last 7 data points.
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- us_agg$positive[1:248]
ASEHolder_usAr <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
phis_usAr = wa_est$phi
thetas_usAr = wa_est$theta
d_usAr = 2
i = 2
for (i in 2:9) {
srt <- windows[i] - 30
forecasts <- fore.aruma.wge(
df[1:(windows[i] - 7)],
phi = phis_usAr,
theta = thetas_usAr,
d = d_usAr,
n.ahead = horizon,
plot = F
)
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(forecasts$f)
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - forecasts$f)^2)
ASEHolder_usAr[nrow(ASEHolder_usAr) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 942523 969581 992176 1017548 1043774 1073745 1106547
## [1] 932566.1 959044.3 986657.4 1015760.6 1044130.1 1072760.2 1103434.0
## [1] 55
## [1] 1680724 1700135 1722794 1746382 1769819 1791211 1811573
## [1] 1686542 1709012 1730342 1751308 1772895 1794388 1814914
## [1] 86
## [1] 2452369 2495262 2536594 2576359 2623989 2674990 2728498
## [1] 2437324 2470900 2507091 2545573 2582273 2620063 2659240
## [1] 117
## [1] 4262807 4321791 4385732 4454538 4522068 4582313 4628823
## [1] 4275827 4342494 4406456 4467264 4531334 4594728 4658511
## [1] 148
## [1] 5814369 5860495 5904458 5943743 5975013 6017434 6047651
## [1] 5806840 5842767 5874289 5909578 5940552 5974018 6005939
## [1] 179
## [1] 7048812 7084695 7121136 7165483 7211088 7260346 7311280
## [1] 7063505 7111425 7160362 7205386 7248572 7292364 7335315
## [1] 210
## [1] 8772165 8860888 8958126 9048711 9122560 9204607 9291273
## [1] 8763679 8835011 8909203 8990158 9071229 9152029 9231423
## [1] 241
## [1] 13055778 13191020 13338607 13515360 13711156 13921360 14146191
## [1] 13067662 13240081 13393590 13557771 13724741 13879860 14046729
meanWindowedASE_usAr <- mean(ASEHolder_usAr$ase)
medWindowedASE_usAr <- median(ASEHolder_usAr$ase)
ASEHolder_usAr
## window ase train_start train_end test_start test_end
## 1 1 36371499 1 24 25 31
## 2 2 32084835 32 55 56 62
## 3 3 1741706565 63 86 87 93
## 4 4 330129843 94 117 118 124
## 5 5 1037255566 125 148 149 155
## 6 6 1009876150 156 179 180 186
## 7 7 2220706094 187 210 211 217
## 8 8 2738494782 218 241 242 248
ASEHolder_usAr %>%
ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6", "7", "8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 2500000000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_usAr,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 2600000000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_usAr),0)), nudge_x = 0.55)
The above are the results of the rolling window ASE analysis. This will be compared to the MLP model in the next section to find the final model.
us_ts = ts(us_agg$positive[1:241], frequency = 1)
set.seed(2)
us_mlp <- mlp(us_ts, sel.lag = TRUE, allow.det.season = TRUE, hd.max = 'valid', comb = c('median'))
us_mlp_f <- forecast(us_mlp, h = 7)
us_mlp_ase <- mean((us_agg$positive[242:248]-us_mlp_f$mean)^2)
us_mlp_ase
## [1] 2416322597
plot(seq(1,248,1),us_agg$positive, type = "b", xlim = c(235,250), ylim = c(11800000, 14500000))
points(seq(242,248,1),us_mlp_f$mean,type = "b", pch = 15)
The quick peak window favors the MLP over the ARUMA(7,2,4), the rolling window should give a better idea of which would be more useful for this analysis.
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- us_agg$positive[1:248]
ASEHolder_mlpN <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
i = 2
set.seed(2)
for (i in 2:9) {
srt <- windows[i] - 30
us_ts_tmp = ts(us_agg$positive[1:(windows[i]-horizon)], frequency = 1)
us_mlpN_tmp <- mlp(us_ts_tmp, sel.lag = TRUE, allow.det.season = TRUE, hd.max = 'valid', comb = c('median'))
us_mlpN_f <- forecast(us_mlpN_tmp, h = 7)
forecasts <- forecast(us_mlpN_tmp, h = 7)
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(as.numeric(forecasts$mean))
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - forecasts$mean)^2)
ASEHolder_mlpN[nrow(ASEHolder_mlpN) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 942523 969581 992176 1017548 1043774 1073745 1106547
## [1] 936420.3 964006.4 990997.3 1017001.5 1048671.2 1079606.9 1112987.9
## [1] 55
## [1] 1680724 1700135 1722794 1746382 1769819 1791211 1811573
## [1] 1686100 1707944 1731783 1756639 1784765 1814241 1844562
## [1] 86
## [1] 2452369 2495262 2536594 2576359 2623989 2674990 2728498
## [1] 2441502 2467565 2490415 2514536 2542698 2570023 2600459
## [1] 117
## [1] 4262807 4321791 4385732 4454538 4522068 4582313 4628823
## [1] 4266149 4328508 4393519 4460292 4525805 4590123 4654900
## [1] 148
## [1] 5814369 5860495 5904458 5943743 5975013 6017434 6047651
## [1] 5814358 5858472 5902796 5945880 5987028 6027303 6068003
## [1] 179
## [1] 7048812 7084695 7121136 7165483 7211088 7260346 7311280
## [1] 7058688 7102678 7149340 7196382 7243293 7289871 7336380
## [1] 210
## [1] 8772165 8860888 8958126 9048711 9122560 9204607 9291273
## [1] 8768521 8838915 8903517 8966267 9031583 9097294 9161082
## [1] 241
## [1] 13055778 13191020 13338607 13515360 13711156 13921360 14146191
## [1] 13071671 13241763 13405013 13573212 13739536 13906004 14070686
meanWindowedASE_mlpN <- mean(ASEHolder_mlpN$ase)
medWindowedASE_mlpN <- median(ASEHolder_mlpN$ase)
ASEHolder_mlpN
## window ase train_start train_end test_start test_end
## 1 1 24262406 1 24 25 31
## 2 2 302557876 32 55 56 62
## 3 3 5837170442 63 86 87 93
## 4 4 129288405 94 117 118 124
## 5 5 95339648 125 148 149 155
## 6 6 672849564 156 179 180 186
## 7 7 6716846123 187 210 211 217
## 8 8 2475174149 218 241 242 248
ASEHolder_mlpN %>% ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6","7","8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 6500000000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_mlpN,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 7000000000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_mlpN),0)), nudge_x = 0.55)
The rolling window analysis for the MLP model shows that it may not be as useful for this analysis. The ARUMA model will be used for the final predictions.
us_ens_ase <- mean((us_agg$positive[242:248]-((as.numeric(us_mlp_f$mean)+us_fore$f)/2))^2)
us_ens_ase
## [1] 5034228905
plot(seq(1,248,1),us_agg$positive, type = "b", xlim = c(235,250), ylim = c(11800000, 14500000))
points(seq(242,248,1),(as.numeric(us_mlp_f$mean)+us_fore$f)/2,type = "b", pch = 15)
This ensemble model does better than the ARUMA on the quick peak, however the MLP did also. Even due to this, the ARUMA model will be used for the final model.
us_d2_fore_final_90 <- fore.aruma.wge(us_agg$positive, phi = us_est$phi, theta = us_est$theta, d=2, n.ahead = 90, plot = F)
us_plot_df <- data.frame(date = us_agg$date, positive = us_agg$positive, ul = us_agg$positive, ll = us_agg$positive)
us_plot_df <- rbind(us_plot_df, data.frame(date = as.Date(future_dates$date, format = "%m/%d/%y"), positive = us_d2_fore_final_90$f, ul = us_d2_fore_final_90$ul, ll = us_d2_fore_final_90$ll))
us_plot_df[247:255,] %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = "blue", alpha = 0.1) +
theme_minimal() +
labs(title = "US Total Positive Cases - 7-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 day", date_labels = "%b %d") +
scale_y_continuous(labels = scales::comma)
The above plot shows the 7 day forecast of total national positive cases with the ARUMA(7,2,4) model.
us_plot_df %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = "blue", alpha = 0.1) +
theme_minimal() +
labs(title = "US Total Positive Cases - 90-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 month", date_labels = "%b - %y") +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
The above plot shows the 90 day forecast of total national positive cases with the ARUMA(7,2,4) model. It appears to be increasing, however there may be a slower rate starting to emerge in a few weeks.
# Create variables for VAR selections
X <- cbind(rev(wa_df$positive)[1:241],rev(wa_df$death)[1:241], rev(wa_df$totalTestResults)[1:241])
wa_var_slct <- VARselect(X, lag.max = 15, type = "both")
wa_var_slct
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 12 12 1 12
##
## $criteria
## 1 2 3 4 5
## AIC(n) 3.468472e+01 3.468085e+01 3.465589e+01 3.464363e+01 3.452441e+01
## HQ(n) 3.477633e+01 3.482744e+01 3.485745e+01 3.490017e+01 3.483591e+01
## SC(n) 3.491174e+01 3.504409e+01 3.515534e+01 3.527931e+01 3.529630e+01
## FPE(n) 1.157151e+15 1.152764e+15 1.124502e+15 1.111083e+15 9.865777e+14
## 6 7 8 9 10
## AIC(n) 3.450361e+01 3.444952e+01 3.440429e+01 3.435521e+01 3.438055e+01
## HQ(n) 3.487009e+01 3.487097e+01 3.488071e+01 3.488660e+01 3.496691e+01
## SC(n) 3.541172e+01 3.549385e+01 3.558483e+01 3.567196e+01 3.583352e+01
## FPE(n) 9.667924e+14 9.165554e+14 8.768518e+14 8.358558e+14 8.585755e+14
## 11 12 13 14 15
## AIC(n) 3.438481e+01 3.398968e+01 3.403173e+01 3.406262e+01 3.409564e+01
## HQ(n) 3.502614e+01 3.468598e+01 3.478300e+01 3.486886e+01 3.495685e+01
## SC(n) 3.597400e+01 3.571508e+01 3.589334e+01 3.606045e+01 3.622969e+01
## FPE(n) 8.637890e+14 5.830822e+14 6.096515e+14 6.306179e+14 6.539873e+14
The above output shows which VAR model would be favorable. The BIC statistic was used for this analysis and p was set equal to 1.
# Quick peak analysis
wa_var <- VAR(X,p=1, type = "both", lag.max = 15)
## Warning in VAR(X, p = 1, type = "both", lag.max = 15): No column names supplied in y, using: y1, y2, y3 , instead.
wa_var_preds <- predict(wa_var,n.ahead = 7)
wa_var_ase <- mean((wa_mod_df$positive[242:248] - wa_var_preds$fcst$y1[,1])^2)
wa_var_ase
## [1] 7838265
plot(seq(1,248,1),wa_mod_df$positive, type = "b", xlim = c(235,250), ylim = c(150000, 180000))
points(seq(242,248,1),wa_var_preds$fcst$y1[,1],type = "b", pch = 15)
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- wa_mod_df$positive[1:248]
ASEHolder_waVar <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
i = 2
for (i in 2:9) {
srt <- windows[i] - 30
X_tmp <- cbind(wa_mod_df$positive[1:(windows[i]-horizon)], wa_mod_df$death[1:(windows[i]-horizon)], wa_mod_df$totalTestResults[1:(windows[i]-horizon)])
wa_var_tmp <- VAR(X_tmp,p=12, type = "both", lag.max = 15)
wa_var_preds_tmp <- predict(wa_var_tmp,n.ahead = 7)
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(wa_var_preds_tmp$fcst$y1[,1])
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - wa_var_preds_tmp$fcst$y1[,1])^2)
ASEHolder_waVar[nrow(ASEHolder_waVar) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 14266 14379 14748 15043 15371 15619 15936
## [1] 14328.89 14495.58 14657.97 14859.67 15092.20 15318.73 15507.55
## [1] 55
## [1] 21634 21985 22285 22630 22778 22878 23288
## [1] 21290.68 21225.15 21387.62 21248.54 20681.96 20318.66 20526.16
## [1] 86
## [1] 33803 34292 34556 35540 36442 37320 38189
## [1] 33823.99 34239.32 34571.60 35403.89 36157.39 36980.97 37822.43
## [1] 117
## [1] 56746 57655 58530 59263 59950 60360 60588
## [1] 56856.33 57978.97 58947.08 59813.01 60753.45 61331.74 61592.58
## [1] 148
## [1] 72161 72703 73301 74320 74320 74635 74939
## [1] 71691.78 71665.74 71381.88 70912.35 70369.37 69778.82 69192.40
## [1] 179
## [1] 85830 86269 86638 87042 87522 88116 88810
## [1] 85666.12 86425.44 86790.51 87461.70 87950.31 88269.20 89069.94
## [1] 210
## [1] 104027 104743 105557 106573 107501 108315 109354
## [1] 103977.5 104689.4 105508.0 106231.0 107397.1 107847.5 108500.8
## [1] 241
## [1] 158167 160634 162700 165019 167216 170342 172437
## [1] 156322.3 157214.4 159680.5 165212.7 171968.3 172858.3 173741.1
meanWindowedASE_waVar <- mean(ASEHolder_waVar$ase)
medWindowedASE_waVar <- median(ASEHolder_waVar$ase)
ASEHolder_waVar
## window ase train_start train_end test_start test_end
## 1 1 58674.80 1 24 25 31
## 2 2 3140047.31 32 55 56 62
## 3 3 50329.09 63 86 87 93
## 4 4 456084.27 94 117 118 124
## 5 5 12686365.75 125 148 149 155
## 6 6 75030.95 156 179 180 186
## 7 7 154560.01 187 210 211 217
## 8 8 7838264.66 218 241 242 248
ASEHolder_waVar %>% ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6","7","8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 5000000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_waVar,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 6000000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_waVar),0)), nudge_x = 0.55)
The above plot shows the result for the rolling window ASE analysis for the VAR model with p = 1. This will be compared to the MLP produced in the next section to choose the best ASE.
wa_xreg <- data.frame(death=ts(wa_mod_df$death), totalTestResults = ts(wa_mod_df$totalTestResults))
set.seed(2)
wa_mlp_mv <- mlp(wa_ts, xreg = wa_xreg, sel.lag = TRUE, allow.det.season = TRUE, hd.max = 'valid', comb = c('median'))
wa_mlp_mv_f <- forecast(wa_mlp_mv, h = 7, xreg = wa_xreg)
wa_mlp_mv_ase <- mean((wa_mod_df$positive[242:248]-wa_mlp_mv_f$mean)^2)
wa_mlp_mv_ase
## [1] 211789121
plot(seq(1,248,1),wa_mod_df$positive, type = "b", xlim = c(235,250), ylim = c(150000, 200000))
points(seq(242,248,1),wa_mlp_mv_f$mean,type = "b", pch = 15)
The quick peak shows that this model is overestimating the predictions for the quick peak.
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- wa_mod_df$positive[1:248]
ASEHolder_mlpWAmv <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
i = 2
set.seed(2)
for (i in 2:9) {
srt <- windows[i] - 30
wa_ts_tmp = ts(wa_mod_df$positive[1:(windows[i]-horizon)], frequency = 1)
wa_xreg_tmp <- data.frame(death=ts(wa_mod_df$death[1:(windows[i])]), totalTestResults = ts(wa_mod_df$totalTestResults[1:(windows[i])]))
wa_mlpWAmv_tmp <- mlp(wa_ts_tmp, xreg = wa_xreg_tmp, sel.lag = TRUE, allow.det.season = TRUE, hd.max = 'valid', comb = c('median'))
wa_mlpWAmv_f <- forecast(wa_mlpWAmv_tmp, h = 7, xreg = wa_xreg_tmp)
# nam <- paste0("forecasts", i)
# assign(nam, forecast(wa_mlpWAmv_tmp, h = horizon, xreg = wa_xreg))
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(as.numeric(wa_mlpWAmv_f$mean))
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - wa_mlpWAmv_f$mean)^2)
ASEHolder_mlpWAmv[nrow(ASEHolder_mlpWAmv) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 14266 14379 14748 15043 15371 15619 15936
## [1] 14368.42 14603.03 14879.69 15091.12 15386.04 15776.39 16025.84
## [1] 55
## [1] 21634 21985 22285 22630 22778 22878 23288
## [1] 21522.00 21795.72 22108.87 22291.15 22536.52 22751.92 23050.59
## [1] 86
## [1] 33803 34292 34556 35540 36442 37320 38189
## [1] 33794.72 34350.68 35096.79 35881.43 36470.62 36953.73 37751.70
## [1] 117
## [1] 56746 57655 58530 59263 59950 60360 60588
## [1] 56914.05 57933.65 58995.44 59739.73 60235.43 60702.74 61641.78
## [1] 148
## [1] 72161 72703 73301 74320 74320 74635 74939
## [1] 71717.46 71800.55 71871.24 71824.51 71920.88 71691.48 71574.67
## [1] 179
## [1] 85830 86269 86638 87042 87522 88116 88810
## [1] 85430.53 85941.25 86258.79 86713.12 87089.87 87528.47 87848.08
## [1] 210
## [1] 104027 104743 105557 106573 107501 108315 109354
## [1] 104081.4 104937.0 106066.5 106769.6 107696.9 108807.4 109669.6
## [1] 241
## [1] 158167 160634 162700 165019 167216 170342 172437
## [1] 163125.6 166869.9 175348.0 178826.3 189746.0 193332.7 203345.2
meanWindowedASE_mlpWAmv <- mean(ASEHolder_mlpWAmv$ase)
medWindowedASE_mlpWAmv <- median(ASEHolder_mlpWAmv$ase)
ASEHolder_mlpWAmv
## window ase train_start train_end test_start test_end
## 1 1 16200.64 1 24 25 31
## 2 2 46398.17 32 55 56 62
## 3 3 105533.98 63 86 87 93
## 4 4 265600.01 94 117 118 124
## 5 5 5003089.56 125 148 149 155
## 6 6 282313.35 156 179 180 186
## 7 7 102764.15 187 210 211 217
## 8 8 343653535.91 218 241 242 248
ASEHolder_mlpWAmv %>% ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6","7","8")) +
ggtitle("ASE values for Model A w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 30000000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_mlpWAmv,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 40000000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_mlpWAmv),0)), nudge_x = 0.55)
The MLP model for this rolling window analysis show that it did not perform as well as the VAR model above. The VAR model will used for the final prediction.
X_fin <- data.frame(positive = wa_mod_df$positive, death = wa_mod_df$death, totalTestResults = wa_mod_df$totalTestResults)
wa_var_fin <- VAR(X_fin,p=1, type = "both", lag.max = 15)
wa_var_preds <- predict(wa_var_fin,n.ahead = 90)
wa_plot_df <- data.frame(date = wa_mod_df$date, positive = wa_mod_df$positive, ll = wa_mod_df$positive, ul = wa_mod_df$positive)
wa_plot_df <- rbind(wa_plot_df, data.frame(date = as.Date(future_dates$date, format = "%m/%d/%y"), positive = abs(wa_var_preds$fcst$positive[,1]), ll = abs(wa_var_preds$fcst$positive[,2]), ul = abs(wa_var_preds$fcst$positive[,3])))
wa_plot_df[247:255,] %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = "red", alpha = 0.1) +
theme_minimal() +
labs(title = "WA Total Positive Cases - 7-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 day", date_labels = "%b %d") +
scale_y_continuous(labels = scales::comma)
The next 7 days seems to still be increasing. The confidence intervals are shown in red.
wa_plot_df %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = "red", alpha = 0.1) +
theme_minimal() +
labs(title = "WA Total Positive Cases - 90-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 month", date_labels = "%b - %y") +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
The 90 day forecast with the multivariate VAR model doesnโt seem to be very reliable for the data once the decrease starts. The above chart actually depicts the absolute value of the predictions, as the model made some predictions that were negative, which obviously wonโt be the case. Additionally the model shows a decrease, however the total number of cases are being modeled so the case number will never decrease.
X2 <- cbind(us_agg$positive[1:242], us_agg$totalTestResults[1:242])
us_var_slct <- VARselect(X2, lag.max = 15, type = "both")
us_var_slct
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 15 11 10 15
##
## $criteria
## 1 2 3 4 5
## AIC(n) 4.266240e+01 4.096830e+01 4.088177e+01 4.078424e+01 4.061844e+01
## HQ(n) 4.271110e+01 4.104136e+01 4.097918e+01 4.090600e+01 4.076455e+01
## SC(n) 4.278310e+01 4.114936e+01 4.112318e+01 4.108600e+01 4.098055e+01
## FPE(n) 3.373241e+18 6.198946e+17 5.685289e+17 5.157263e+17 4.369677e+17
## 6 7 8 9 10
## AIC(n) 4.056958e+01 4.028079e+01 4.018110e+01 4.018973e+01 4.005607e+01
## HQ(n) 4.074005e+01 4.047561e+01 4.040028e+01 4.043326e+01 4.032395e+01
## SC(n) 4.099204e+01 4.076360e+01 4.072427e+01 4.079324e+01 4.071994e+01
## FPE(n) 4.161781e+17 3.118369e+17 2.823064e+17 2.848233e+17 2.492643e+17
## 11 12 13 14 15
## AIC(n) 4.000989e+01 4.000932e+01 4.000597e+01 3.999360e+01 3.992843e+01
## HQ(n) 4.030212e+01 4.032591e+01 4.034691e+01 4.035889e+01 4.031807e+01
## SC(n) 4.073411e+01 4.079389e+01 4.085089e+01 4.089887e+01 4.089406e+01
## FPE(n) 2.381019e+17 2.380696e+17 2.373942e+17 2.346115e+17 2.199575e+17
VAR select for the US multivariate.
us_var <- VAR(X2,p=10, type = "both", lag.max = 15)
us_var_preds <- predict(us_var,n.ahead = 7)
us_var_ase <- mean((us_agg$positive[242:248] - us_var_preds$fcst$y1[,1])^2)
us_var_ase
## [1] 11474555565
plot(seq(1,248,1),us_agg$positive, type = "b", xlim = c(235,250), ylim = c(11800000, 14500000))
points(seq(242,248,1),us_var_preds$fcst$y1[,1],type = "b", pch = 15)
Quick peak looks OK, onto rolling window.
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- us_agg$positive[1:248]
ASEHolder_usVar <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
i = 2
for (i in 2:9) {
srt <- windows[i] - 30
X_tmp <- cbind(us_agg$positive[1:(windows[i]-horizon)], us_agg$totalTestResults[1:(windows[i]-horizon)])
us_var_tmp <- VAR(X_tmp,p=12, type = "both", lag.max = 15)
us_var_preds_tmp <- predict(us_var_tmp,n.ahead = 7)
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(us_var_preds_tmp$fcst$y1[,1])
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - us_var_preds_tmp$fcst$y1[,1])^2)
ASEHolder_usVar[nrow(ASEHolder_usVar) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 942523 969581 992176 1017548 1043774 1073745 1106547
## [1] 946868.8 981570.3 1013339.8 1044870.8 1081860.2 1122449.5 1166014.8
## [1] 55
## [1] 1680724 1700135 1722794 1746382 1769819 1791211 1811573
## [1] 1681730 1700208 1721712 1740712 1760364 1778191 1797133
## [1] 86
## [1] 2452369 2495262 2536594 2576359 2623989 2674990 2728498
## [1] 2453435 2499077 2544528 2592163 2646278 2706640 2774296
## [1] 117
## [1] 4262807 4321791 4385732 4454538 4522068 4582313 4628823
## [1] 4264578 4324271 4389968 4458106 4526627 4584552 4635839
## [1] 148
## [1] 5814369 5860495 5904458 5943743 5975013 6017434 6047651
## [1] 5816265 5861776 5905372 5942980 5976181 6010690 6052678
## [1] 179
## [1] 7048812 7084695 7121136 7165483 7211088 7260346 7311280
## [1] 7049274 7088382 7132785 7175268 7218447 7269857 7318272
## [1] 210
## [1] 8772165 8860888 8958126 9048711 9122560 9204607 9291273
## [1] 8768209 8850112 8941084 9029532 9106337 9176827 9250856
## [1] 241
## [1] 13055778 13191020 13338607 13515360 13711156 13921360 14146191
## [1] 13070773 13210970 13327378 13482822 13617599 13757817 13902040
meanWindowedASE_usVar <- mean(ASEHolder_usVar$ase)
medWindowedASE_usVar <- median(ASEHolder_usVar$ase)
ASEHolder_usVar
## window ase train_start train_end test_start test_end
## 1 1 1245167275 1 24 25 31
## 2 2 71680207 32 55 56 62
## 3 3 560626318 63 86 87 93
## 4 4 16425446 94 117 118 124
## 5 5 11252282 125 148 149 155
## 6 6 62679125 156 179 180 186
## 7 7 494065151 187 210 211 217
## 8 8 13845247729 218 241 242 248
ASEHolder_usVar %>% ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6","7","8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 3000000000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_usVar,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 2500000000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_usVar),0)), nudge_x = 0.55)
These results will be compared to the multivariate MLP in the next block of code.
# Changes made to this loop
windows <- seq(0,248,31)
trainingSize = 25
horizon = 7
df <- us_agg$positive[1:248]
ASEHolder_mlpMv <- data.frame(
window=integer(),
ase=numeric(),
train_start=integer(),
train_end=integer(),
test_start=integer(),
test_end=integer()
)
i = 2
set.seed(2)
for (i in 2:9) {
srt <- windows[i] - 30
us_ts_tmp = ts(us_agg$positive[1:(windows[i]-horizon)], frequency = 1)
us_xreg_tmp <- data.frame(totalTestResults = ts(us_agg$totalTestResults[1:(windows[i])]))
us_mlpMv_tmp <- mlp(us_ts_tmp, xreg = us_xreg_tmp, sel.lag = TRUE, allow.det.season = TRUE, hd.max = 'valid', comb = c('median'))
us_mlpMv_f <- forecast(us_mlpMv_tmp, h = 7, xreg = us_xreg_tmp)
# nam <- paste0("forecasts", i)
# assign(nam, forecast(wa_mlpMv_tmp, h = horizon, xreg = wa_xreg))
print(windows[i]-horizon)
print(df[((windows[i]-horizon) + 1):windows[i]])
print(as.numeric(us_mlpMv_f$mean))
ASE <- mean((df[((windows[i]-horizon) + 1):windows[i]] - us_mlpMv_f$mean)^2)
ASEHolder_mlpMv[nrow(ASEHolder_mlpMv) + 1,] <- c(i-1,
ASE ,
srt,
windows[i] - horizon,
(windows[i]-horizon) + 1,
windows[i]
)
}
## [1] 24
## [1] 942523 969581 992176 1017548 1043774 1073745 1106547
## [1] 937032.5 961832.9 985523.1 1010475.8 1038207.8 1071204.9 1107000.6
## [1] 55
## [1] 1680724 1700135 1722794 1746382 1769819 1791211 1811573
## [1] 1684571 1705705 1730266 1754751 1776013 1795265 1814991
## [1] 86
## [1] 2452369 2495262 2536594 2576359 2623989 2674990 2728498
## [1] 2445050 2483227 2524751 2567845 2612159 2655901 2701166
## [1] 117
## [1] 4262807 4321791 4385732 4454538 4522068 4582313 4628823
## [1] 4266214 4329112 4394191 4462927 4533564 4600884 4664088
## [1] 148
## [1] 5814369 5860495 5904458 5943743 5975013 6017434 6047651
## [1] 5813001 5856342 5895270 5929004 5962841 5996987 6031042
## [1] 179
## [1] 7048812 7084695 7121136 7165483 7211088 7260346 7311280
## [1] 7060210 7106973 7150446 7195660 7240787 7286485 7334487
## [1] 210
## [1] 8772165 8860888 8958126 9048711 9122560 9204607 9291273
## [1] 8771199 8854660 8938712 9019588 9099414 9179478 9260636
## [1] 241
## [1] 13055778 13191020 13338607 13515360 13711156 13921360 14146191
## [1] 13084914 13257365 13431019 13608466 13788670 13958696 14132587
meanWindowedASE_mlpMv <- mean(ASEHolder_mlpMv$ase)
medWindowedASE_mlpMv <- median(ASEHolder_mlpMv$ase)
ASEHolder_mlpMv
## window ase train_start train_end test_start test_end
## 1 1 31728002 1 24 25 31
## 2 2 34022144 32 55 56 62
## 3 3 237497648 63 86 87 93
## 4 4 275398319 94 117 118 124
## 5 5 166125010 125 148 149 155
## 6 6 642833362 156 179 180 186
## 7 7 481517735 187 210 211 217
## 8 8 4292390591 218 241 242 248
ASEHolder_mlpMv %>% ggplot() +
geom_histogram(
aes(x = window, y = ase, fill = as.factor(window)),
stat = "identity",
show.legend = F
) +
theme_minimal() +
scale_x_discrete(name ="Window",
limits=c("1","2","3","4","5", "6","7","8")) +
ggtitle("ASE values w/ 8 rolling windows") +
geom_text(mapping = aes(x = 2, y = 3000000000), label = paste("Mean ASE Model A: ", round(meanWindowedASE_mlpMv,0)), nudge_x = 0.5) +
geom_text(mapping = aes(x = 2, y = 2800000000), label = paste("Mean RMSE Model A: ", round(sqrt(meanWindowedASE_mlpMv),0)), nudge_x = 0.55)
The multivariate MLP model does actually perform better by rolling window ASE over the VAR model. Even with this, the VAR model will be used for the final predictions, as it seems like the last window for the VAR model was the real issue.
X_fin <- data.frame(positive = us_agg$positive, totalTestResults = us_agg$totalTestResults)
us_var_fin <- VAR(X_fin, p=10, type = "both", lag.max = 15)
us_var_preds <- predict(us_var_fin,n.ahead = 90)
us_plot_df <- data.frame(date = us_agg$date, positive = us_agg$positive, ll = us_agg$positive, ul = us_agg$positive)
us_plot_df <- rbind(us_plot_df, data.frame(date = as.Date(future_dates$date, format = "%m/%d/%y"), positive = abs(us_var_preds$fcst$positive[,1]), ll = abs(us_var_preds$fcst$positive[,2]), ul = abs(us_var_preds$fcst$positive[,3])))
us_plot_df[247:255,] %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = "blue", alpha = 0.1) +
theme_minimal() +
labs(title = "US Total Positive Cases - 7-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 day", date_labels = "%b %d") +
scale_y_continuous(labels = scales::comma)
Above is the 7 day prediction with the national multivariate VAR model. The confidence intervals are shown in blue.
us_plot_df %>%
ggplot(mapping = aes(x = date, y = positive)) +
geom_line() +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = "blue", alpha = 0.1) +
theme_minimal() +
labs(title = "US Total Positive Cases - 90-day forecast", x = "Date", y = "Number of Cases") +
scale_x_date(breaks = "1 month", date_labels = "%b - %y") +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
Above is the 90 day prediction with the national multivariate VAR model, the confidence intervals are shown in blue. This VAR model prediction looks much better than the state level data. This VAR model only used total number of tests as an exogenous variable.
map <- plot_usmap(data = df2, values = "positive") +
scale_fill_distiller(palette = rev("YlOrRd")) +
labs(title = 'Cumulative COVID-19 Positive Tests - Date: {frame_time}', color = "Cumulative Positive Cases") +
transition_time(date)
animate(map, fps=5)
Just a really cool viz I worked hard to get to work!!! Shows the national progress of the disease by positive number of cases. As the states on the graph get more yellow, the higher the total number of cases in the state. We can see the highest number of cases are in CA, TX, NY, FL and IL. There also seems to be an elevated number in the East Coast southern states as well.